##Task 1: Plotting in ggplot (Focus: aes, lab, scales) Goal: understanding the plotting syntax of ggplot -> What types of plots are there, what types of aesthetics, and what commands are useful to edit the plots? –> Using Huber book Chapter 3 (High Quality Graphics in R)
‘plot’ as the most basic function Example:
head(DNase) # just an example of an enzyme, which dataframe is already in R --> derades DNA --> can easy plot the concentration vs. density
## Run conc density
## 1 1 0.04882812 0.017
## 2 1 0.04882812 0.018
## 3 1 0.19531250 0.121
## 4 1 0.19531250 0.124
## 5 1 0.39062500 0.206
## 6 1 0.39062500 0.215
plot(DNase$conc, DNase$density)
# plot can be customized by e.g. changing plot symbol and axis labels:
plot(DNase$conc, DNase$density,
ylab = attr(DNase, "labels")$y,
xlab = paste(attr(DNase, "labels")$x, attr(DNase, "units")$x),
pch = 3,
col = "blue")
#or use a histohram or a boxplot:
hist(DNase$density, breaks=25, main="")
boxplot(density ~ Run, data=DNase)
loading the package and redoing the simple plot:
library("ggplot2")
## Warning: 程辑包'ggplot2'是用R版本4.3.1 来建造的
ggplot(DNase, aes(x=conc, y=density))+geom_point()
#What do all those parts mean ? :
#first: specified the dataframe (DNase)
# second: aes (aesthetic) argument: which variables will be mapped to the x- and y-axis
#third: saying we want to use points: geom_point()
In the book they used a dataset, that I did not manage to download -> let’s just use the dataset from last week
gapminder_raw <- read.csv("https://raw.githubusercontent.com/resbaz/r-novice-gapminder-files/master/data/gapminder-FiveYearData.csv")
head(gapminder_raw)
## country year pop continent lifeExp gdpPercap
## 1 Afghanistan 1952 8425333 Asia 28.801 779.4453
## 2 Afghanistan 1957 9240934 Asia 30.332 820.8530
## 3 Afghanistan 1962 10267083 Asia 31.997 853.1007
## 4 Afghanistan 1967 11537966 Asia 34.020 836.1971
## 5 Afghanistan 1972 13079460 Asia 36.088 739.9811
## 6 Afghanistan 1977 14880372 Asia 38.438 786.1134
library(tidyverse)
## Warning: 程辑包'tidyverse'是用R版本4.3.1 来建造的
## Warning: 程辑包'tibble'是用R版本4.3.1 来建造的
## Warning: 程辑包'tidyr'是用R版本4.3.1 来建造的
## Warning: 程辑包'readr'是用R版本4.3.1 来建造的
## Warning: 程辑包'purrr'是用R版本4.3.1 来建造的
## Warning: 程辑包'dplyr'是用R版本4.3.1 来建造的
## Warning: 程辑包'forcats'是用R版本4.3.1 来建造的
## Warning: 程辑包'lubridate'是用R版本4.3.1 来建造的
## ── Attaching core tidyverse packages ──────────────────────── tidyverse 2.0.0 ──
## ✔ dplyr 1.1.2 ✔ readr 2.1.4
## ✔ forcats 1.0.0 ✔ stringr 1.5.0
## ✔ lubridate 1.9.2 ✔ tibble 3.2.1
## ✔ purrr 1.0.1 ✔ tidyr 1.3.0
## ── Conflicts ────────────────────────────────────────── tidyverse_conflicts() ──
## ✖ lubridate::%within%() masks IRanges::%within%()
## ✖ dplyr::collapse() masks IRanges::collapse()
## ✖ dplyr::combine() masks BiocGenerics::combine()
## ✖ dplyr::desc() masks IRanges::desc()
## ✖ tidyr::expand() masks S4Vectors::expand()
## ✖ dplyr::filter() masks stats::filter()
## ✖ dplyr::first() masks S4Vectors::first()
## ✖ dplyr::lag() masks stats::lag()
## ✖ ggplot2::Position() masks BiocGenerics::Position(), base::Position()
## ✖ purrr::reduce() masks GenomicRanges::reduce(), IRanges::reduce()
## ✖ dplyr::rename() masks S4Vectors::rename()
## ✖ lubridate::second() masks S4Vectors::second()
## ✖ lubridate::second<-() masks S4Vectors::second<-()
## ✖ dplyr::slice() masks IRanges::slice()
## ℹ Use the conflicted package (<http://conflicted.r-lib.org/>) to force all conflicts to become errors
# using only the "Americas" in the year 2007
gapminder_raw %>% filter(continent == "Americas", year=="2007") %>%
#geom_bar: each data item represented by a bar
#stat ="identity -> do nothing (otherwise would compute histogram of the data, default state = "count")
ggplot(aes(x=country, y=lifeExp))+geom_bar(stat="identity")
adding color und Beschriftung um 90 Grad drehen:
gg= gapminder_raw %>% filter(continent == "Americas", year=="2007") %>% ggplot(aes(x=country, y=lifeExp, fill= country)) +geom_bar(stat="identity") + theme(axis.text.x = element_text(angle = 90, hjust =1))
print(gg)
Oben haben wir unseren Plot gg genannt, jetzt speichern wir ihn unter Beispiel.pdf der Aufbau ist von R vorgegeben
ggplot2::ggsave("Beispiel.pdf", plot=gg)
## Saving 7 x 5 in image
one or more datasets,
one or more geometric objects that serve as the visual representations of the data, – for instance, points, lines, rectangles, contours,
descriptions of how the variables in the data are mapped to visual properties (aesthetics) of the geometric objects, and an associated scale (e. g., linear, logarithmic, rank),
one or more coordinate systems,
statistical summarization rules,
a facet specification, i.e. the use of multiple similar subplots to look at subsets of the same data,
optional parameters that affect the layout and rendering, such text size, font and alignment, legend positions
–> a ggplot needs at least one of those parts –> 4-7 are optional ## Visualizing data in 1D ## Barplots
genes <- data.frame(
gene = rep(c("GeneA", "GeneB", "GeneC", "GeneD", "GeneE"), each = 20),
value = c(rnorm(20, mean = 10), rnorm(20, mean = 20),
rnorm(20, mean = 30), rnorm(20, mean = 40),
rnorm(20, mean = 50)))
library("ggplot2")
#install.packages("Hmisc")
library("Hmisc")
## Warning: 程辑包'Hmisc'是用R版本4.3.1 来建造的
##
## 载入程辑包:'Hmisc'
## The following objects are masked from 'package:dplyr':
##
## src, summarize
## The following objects are masked from 'package:base':
##
## format.pval, units
ggplot(genes, aes( x = gene, y = value, fill = gene)) +
stat_summary(fun = mean, geom = "bar") +
stat_summary(fun.data = mean_cl_normal, geom = "errorbar",
width = 0.25)
#mean_cl_normal computes the standard error of the mean
much more informative than barplots
p= ggplot(genes, aes(x=gene, y=value, fill=gene))
p + geom_boxplot()
#especially for larger datasets -> see if it is right or left skewed
p + geom_dotplot(binaxis = "y", binwidth = 1/6,
stackdir = "center", stackratio = 0.75,
aes(color = gene))
#install.packages("ggbeeswarm")
library("ggbeeswarm")
## Warning: 程辑包'ggbeeswarm'是用R版本4.3.1 来建造的
p + geom_beeswarm(aes(color = gene))
# often a lot of overlap in dataset-> with ggbeeswarm- no overlap
##Density plots
#ziemlich selbsterklärend - oft sehr praktisch (Vergleiche)
ggplot(genes, aes(x=value, color=gene)) + geom_density()
–> dienen zur besseren Übersicht- von boxplots inspiriert –> symmetrische Formen erinnern dann an Geigen
p + geom_violin()
visualisiren auch density, vor allem bei großen data vorteilhaft(übersichtlicher, klarer Unterschied und Tendenzen zu sehen)
#install.packages("ggridges")
library("ggridges")
## Warning: 程辑包'ggridges'是用R版本4.3.1 来建造的
ggplot(genes, aes(x = value, y = gene, fill = gene)) +
geom_density_ridges()
## Picking joint bandwidth of 0.424
-> empirical cumulative distribution function
ggplot(genes, aes( x = value, color = gene)) + stat_ecdf()
+lossless + “Plotting the sorted values against their ranks gives the
essential features of the ECDF”
ggplot(gapminder_raw, aes(x= gdpPercap, y=lifeExp)) + geom_point() + labs(x= "GDP per person", y="life expectancy") + ggtitle("association between gdp per person and life expectancy")
#adjust transparency(alpha value):
ggplot(gapminder_raw, aes(x= gdpPercap, y=lifeExp)) + geom_point(alpha=0.1) + labs(x= "GDP per person", y="life expectancy") + ggtitle("association between gdp per person and life expectancy")
#man könnte auch noch die density veranschaulichen(wie die Abbildung von Bergen zu lesen)
ggplot(gapminder_raw, aes(x= gdpPercap, y=lifeExp)) + geom_point(alpha=0.1) + labs(x= "GDP per person", y="life expectancy") + geom_density2d()+ ggtitle("association between gdp per person and life expectancy")
The geom_point geometric object offers the following aesthetics (beyond x and y): fill color shape size alpha (all already used, alpha previously explained)
Another way to show additional dimensions of the data is to show multiple plots that result from repeatedly subsetting (or “slicing”) our data based on one (or more) of the variables, so that we can visualize each part separately using: ’facet_grid( . ~lineage)
Another useful function is ‘facet_wrap’: if the faceting variable has too many levels for all the plots to fit in one row or one column, then this function can be used to wrap them into a specified number of columns or rows
shiny -> web application framework ggvis -> extending ggplot2 into realm of interactive graphics(JavaScript) plotly ->
#install.packages("plotly")
library("plotly")
## Warning: 程辑包'plotly'是用R版本4.3.1 来建造的
##
## 载入程辑包:'plotly'
## The following object is masked from 'package:Hmisc':
##
## subplot
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:rtracklayer':
##
## export
## The following object is masked from 'package:IRanges':
##
## slice
## The following object is masked from 'package:S4Vectors':
##
## rename
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
plot_ly(economics, x = ~ date, y = ~ unemploy / pop)
## No trace type specified:
## Based on info supplied, a 'scatter' trace seems appropriate.
## Read more about this trace type -> https://plotly.com/r/reference/#scatter
## No scatter mode specifed:
## Setting the mode to markers
## Read more about this attribute -> https://plotly.com/r/reference/#scatter-mode
rgl -> (TRY THAT ONE !!! THAT IS SO COOL)
data("volcano")
volcanoData = list(
x = 10 * seq_len(nrow(volcano)),
y = 10 * seq_len(ncol(volcano)),
z = volcano,
col = terrain.colors(500)[cut(volcano, breaks = 500)]
)
#install.packages("rgl")
library("rgl")
## Warning: 程辑包'rgl'是用R版本4.3.1 来建造的
with(volcanoData, persp3d(x, y, z, color = col))
##Color: -> did that very often -> will not explain ausführlich
pie(rep(1, 8), col=1:8)
-> powerful way of visualizing large, matrix-like datasets and providing a quick overview of patterns that might be in the data
# Sample data for the heatmap
matrix_data <- matrix(
c(1, 2, 3, 4, 5, 6, 7, 8, 9), # Example values
nrow = 3, # Number of rows
ncol = 3, # Number of columns
byrow = TRUE # Fill the matrix by row
)
# Create the heatmap
heatmap(matrix_data)
We already know log and its effect Another way: rank() –> shows the relative position to the other values and replaces the original values with their corresponding ranks
# Load the required packages
library(ggplot2)
library(dplyr)
# Rank-transform the 'lifeExp' variable
gapminder_raw <- gapminder_raw %>%
mutate(rank_lifeExp = rank(lifeExp))
# Create the scatter plot with rank-transformed y-axis
ggplot(gapminder_raw, aes(x = gdpPercap, y = rank_lifeExp)) +
geom_point() +
labs(x = "GDP per person", y = "Rank of life expectancy") +
ggtitle("Association between GDP per person and life expectancy (Rank Transformation)")
# Zum vergleich:
ggplot(gapminder_raw, aes(x= gdpPercap, y=lifeExp)) + geom_point() + labs(x= "GDP per person", y="life expectancy") + ggtitle("association between gdp per person and life expectancy")
##Summary: ## Which representation of data frames is used in ggplot (wide or long)? How to convert the other type into the format?
–> preferred format is long format (also known as “tidy” format) If your data frame is in the wide format, where each variable is represented by a separate column, you can convert it to the long format using various functions in R, such as “pivot_longer()” from the tidyverse package or “melt()” from the reshape2 package.
# Sample wide-format data frame
wide_df <- data.frame(
Month = c("Jan", "Feb", "Mar"),
ProductA = c(100, 120, 90),
ProductB = c(80, 70, 110),
ProductC = c(150, 130, 140)
)
print(wide_df)
## Month ProductA ProductB ProductC
## 1 Jan 100 80 150
## 2 Feb 120 70 130
## 3 Mar 90 110 140
# Convert wide-format data frame to long format
library(tidyverse)
long_df <- wide_df %>%
pivot_longer(cols = -Month, names_to = "Product", values_to = "Sales")
# Print the long-format data frame
print(long_df)
## # A tibble: 9 × 3
## Month Product Sales
## <chr> <chr> <dbl>
## 1 Jan ProductA 100
## 2 Jan ProductB 80
## 3 Jan ProductC 150
## 4 Feb ProductA 120
## 5 Feb ProductB 70
## 6 Feb ProductC 130
## 7 Mar ProductA 90
## 8 Mar ProductB 110
## 9 Mar ProductC 140
# Sample data frame with a categorical variable
df <- data.frame(
Category = c("Low", "Medium", "High", "Low", "High"),
Value = c(10, 20, 15, 5, 8)
)
# Define custom order for the Category variable
custom_order <- c("Medium", "Low", "High")
df$Category <- factor(df$Category, levels = custom_order)
# Create the plot
library(ggplot2)
ggplot(df, aes(x = Category, y = Value)) +
geom_bar(stat = "identity") +
labs(x = "Category", y = "Value") +
ggtitle("Custom Order of Categorical Variable")
##Task 2: Deep dive into ggplot (Focus: geom …) b) Workout the pros and cons for every plot. Some points which are of special interest: • How many dimensions are involved in the plot (e.g. in long format how many columns, histogram: one, scatterplot: two)? • Can you see the whole data or is it summarized in a way? • What are dangers in this kind of representation for the data? • Give a usecase for every plot (where this plot would fit best).
Dimensions: two dimensions - one for the categorical variable on the x-axis and another for the corresponding values on the y-axis. Data Representation: You can see both the individual values and the summarized values as bars of different heights. Dangers: Similar to regular bar plots, the dangers include misleading visualizations if the scale of the y-axis is manipulated or if the bars are distorted in any way. Use Case: Bar plots in ggplot are suitable for comparing and visualizing categorical data, showing the distribution of a variable across different groups or categories.
Dimensions: one dimension, representing the distribution. Data Representation: Histograms provide a summarized view of the data, showing the frequency or density of values within each bin. Dangers: Similar to regular histograms, misrepresentations can occur if the number of bins is chosen poorly, resulting in either over-smoothing or over-detailing the underlying distribution. Use Case: Histograms in ggplot are useful for understanding the distribution and shape of continuous or discrete data, identifying patterns, and detecting outliers.
Dimensions: two dimensions, one for each variable being compared. Data Representation: The entire dataset is visible, and individual points show the relationship between the two variables. Dangers: Overplotting can occur if there are too many data points, making it difficult to discern patterns or relationships. Overlapping points may also obscure certain observations. Use Case: Scatter plots in ggplot are effective for visualizing the relationship between two continuous variables, identifying trends, clusters, or outliers, and detecting correlations.
Dimensions: two dimensions, typically time or another continuous variable on the x-axis and the corresponding variable values on the y-axis. Data Representation: Line plots show the trend and progression of the data over time or another continuous variable. Dangers: If the data points are not properly connected or if the scale of the y-axis is manipulated, the representation may be misleading. Use Case: Line plots in ggplot are ideal for displaying time series data, illustrating trends, changes over time, and comparing multiple trends on the same plot.
Dimensions: one dimension, representing different categories or groups as different slices of the pie. Data Representation: The whole dataset is visible, and the size of each slice represents the proportion or percentage of the whole. Dangers: Similar to regular pie charts, the dangers include difficulties in accurately comparing the sizes of different slices or when there are too many categories. Use Case: Pie charts in ggplot are useful for displaying proportions and percentages, especially when the number of categories is small and the differences between them are substantial.
# using scatter plot to compare the length of transcript and exon
library(ggplot2)
ids <- unique(gtf_df$gene_id)
new_df<-NULL
for(e in 1:7126) rbind(new_df, data.frame(id = ids[e], transcript = gtf_df[gtf_df$gene_id == ids[e]>f_df$type == "transcript", 4], exon = gtf_df[gtf_df$gene_id == ids[e]>f_df$type == "exon", 4])) -> new_df
head(new_df)
## id transcript exon
## 1 YDL248W 1152 1152
## 2 YDL247W-A 75 75
## 3 YDL247W 1830 1830
## 4 YDL246C 1074 1074
## 5 YDL245C 1704 1704
## 6 YDL244W 1023 1023
g <- ggplot(new_df, aes(transcript, exon))
g + geom_point() +
geom_smooth(method="lm", se=F) +
labs(subtitle="compare the length of transcript and exon",
y="exon",
x="transcript",
title="Scatterplot with overlapping points")
## `geom_smooth()` using formula = 'y ~ x'
# observe the distribution of length density
a <- ggplot(gtf_df, aes(x = width)) + stat_density()
a + geom_area(aes(fill = type), stat ="bin", alpha=0.6) +
theme_classic()
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.